Add real admission data pipeline and model calibration engine#1
Merged
YichengYang-Ethan merged 1 commit intomainfrom Mar 15, 2026
Merged
Add real admission data pipeline and model calibration engine#1YichengYang-Ethan merged 1 commit intomainfrom
YichengYang-Ethan merged 1 commit intomainfrom
Conversation
- core/admission_data.py: CSV loader with GPA normalization (4/4.3/5/100 scales), background tier classification, internship scoring, and per-program statistics with feature importance analysis - core/calibrator.py: Calibration engine that computes data-driven GPA thresholds, predicts outcomes, evaluates model accuracy, and generates school_ranker overrides - Updated school_ranker to accept calibration overrides for data-driven reach/target/safety classification - CLI: added 'stats' and 'calibrate' commands - data/admissions/sample.csv: 30 sample records across 11 programs - 45 new tests (218 total), all passing; ruff clean https://claude.ai/code/session_014dkZ9Eq3DPVaUfRTeN2HXp
There was a problem hiding this comment.
Pull request overview
This PR introduces a real admissions data pipeline and a calibration engine to derive data-driven thresholds/feature weights from historical outcomes, and integrates those thresholds into the school ranking flow via optional overrides.
Changes:
- Added
core/admission_data.pyto load/normalize admissions CSVs and compute per-program statistics (including feature importance). - Added
core/calibrator.pyto compute calibrated program thresholds, evaluate prediction accuracy, and generateschool_rankeroverride dictionaries. - Extended CLI and
school_rankerto consume/show calibration outputs; added sample/template CSVs and new tests for the new modules.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| core/admission_data.py | CSV loader + GPA normalization + background tiering + internship scoring + stats/feature importance. |
| core/calibrator.py | Calibration engine (thresholds, accuracy evaluation, recommendations) + ranker override generation. |
| core/school_ranker.py | Adds optional calibration_overrides and override-aware reach/target/safety classification. |
| cli/main.py | Adds stats and calibrate CLI commands to summarize data and run calibration. |
| data/admissions/template.csv | Adds admissions CSV header template. |
| data/admissions/sample.csv | Adds sample admissions dataset for demonstration/testing. |
| tests/test_admission_data.py | New unit tests for CSV loading, normalization, scoring, and stats computation. |
| tests/test_calibrator.py | New unit tests for calibration, prediction, and override generation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+80
to
+108
| if scale == 4: | ||
| return min(4.0, gpa) | ||
|
|
||
| breakpoints = _GPA_SCALE_TO_4.get(scale) | ||
| if breakpoints is None: | ||
| # Unknown scale — attempt linear conversion | ||
| return min(4.0, gpa * 4.0 / scale) | ||
|
|
||
| for threshold, mapped_lo, mapped_hi in breakpoints: | ||
| if gpa >= threshold: | ||
| # Find the top of this segment | ||
| # For the highest segment, cap at max GPA | ||
| seg_top = scale if breakpoints[0] == (threshold, mapped_lo, mapped_hi) else threshold | ||
| # Use previous segment's threshold as the top | ||
| idx = breakpoints.index((threshold, mapped_lo, mapped_hi)) | ||
| if idx == 0: | ||
| seg_top = scale | ||
| else: | ||
| seg_top = breakpoints[idx - 1][0] | ||
|
|
||
| if seg_top == threshold: | ||
| return mapped_hi | ||
|
|
||
| frac = (gpa - threshold) / (seg_top - threshold) | ||
| return mapped_lo + frac * (mapped_hi - mapped_lo) | ||
|
|
||
| return 0.0 | ||
|
|
||
|
|
Comment on lines
+187
to
+198
| # Count internships (Chinese: 段) | ||
| for char in "段": | ||
| count = desc.count(char) | ||
| if count > 0: | ||
| # Extract number before 段 | ||
| for i, c in enumerate(desc): | ||
| if c == "段": | ||
| if i > 0 and desc[i - 1].isdigit(): | ||
| n = int(desc[i - 1]) | ||
| score += min(n * 1.5, 5.0) | ||
| break | ||
|
|
| for feat in feature_sums | ||
| } | ||
| total = sum(raw.values()) or 1.0 | ||
| return {feat: round(val / total, 3) for feat, val in sorted(raw.items(), key=lambda x: -x[1])} |
Comment on lines
+101
to
+124
| threshold = ProgramThreshold( | ||
| program_id=stats.program_id, | ||
| sample_size=stats.total_records, | ||
| confidence=_confidence_level(stats.total_records), | ||
| observed_acceptance_rate=stats.observed_acceptance_rate, | ||
| ) | ||
|
|
||
| if accepted: | ||
| # GPA floor: minimum GPA among accepted applicants | ||
| gpas_accepted = [r.gpa_normalized for r in accepted] | ||
| threshold.gpa_floor = min(gpas_accepted) | ||
| threshold.gpa_target = sum(gpas_accepted) / len(gpas_accepted) | ||
| # Safe threshold: 90th percentile of accepted | ||
| sorted_gpas = sorted(gpas_accepted) | ||
| p90_idx = int(len(sorted_gpas) * 0.9) | ||
| threshold.gpa_safe = sorted_gpas[min(p90_idx, len(sorted_gpas) - 1)] | ||
|
|
||
| # Background tier | ||
| threshold.max_bg_tier_accepted = max(r.bg_tier for r in accepted) | ||
|
|
||
| # Intern score | ||
| intern_scores = [r.intern_score for r in accepted] | ||
| threshold.min_intern_score_accepted = min(intern_scores) | ||
|
|
Comment on lines
+215
to
+218
| result_entry["calibrated"] = True | ||
| result_entry["confidence"] = prog_overrides.get("confidence", "low") | ||
| result_entry["sample_size"] = prog_overrides.get("sample_size", 0) | ||
|
|
| from __future__ import annotations | ||
|
|
||
| import csv | ||
| import tempfile |
Comment on lines
+199
to
+215
| # Quality keywords (Chinese + English) | ||
| quality_keywords = { | ||
| "顶级": 2.0, "top": 1.5, "百亿": 1.5, "头部": 1.5, | ||
| "一线": 1.0, "知名": 0.8, "大型": 0.5, | ||
| } | ||
| for kw, pts in quality_keywords.items(): | ||
| if kw in desc: | ||
| score += pts | ||
|
|
||
| # Type keywords | ||
| type_keywords = { | ||
| "量化": 1.5, "quant": 1.5, "投行": 1.5, "ib": 1.0, | ||
| "对冲": 1.5, "hedge": 1.5, "私募": 1.0, "qr": 1.0, | ||
| "trading": 1.0, "研究": 0.8, "金工": 0.8, | ||
| "三中一华": 2.0, "高盛": 2.0, "goldman": 2.0, | ||
| "摩根": 2.0, "morgan": 1.5, "kaggle": 1.5, | ||
| } |
Comment on lines
+184
to
+190
| # Classification (with optional data-driven overrides). | ||
| prog_overrides = overrides.get(prog.id) | ||
| category = _classify( | ||
| user_gpa=profile.gpa, | ||
| program_avg_gpa=prog.avg_gpa, | ||
| acceptance_rate=prog.acceptance_rate, | ||
| overrides=prog_overrides, |
Comment on lines
+779
to
+785
| console.print(Panel("Ranker Overrides (Applied)", border_style="green")) | ||
| for pid, ov in sorted(overrides.items()): | ||
| console.print( | ||
| f" {pid}: reach<{ov['reach_gpa_threshold']:.2f} " | ||
| f"safe>={ov['safety_gpa_threshold']:.2f} " | ||
| f"[dim](n={ov['sample_size']}, {ov['confidence']})[/dim]" | ||
| ) |
Comment on lines
+7
to
+16
| from core.admission_data import AdmissionRecord | ||
| from core.calibrator import ( | ||
| CalibrationResult, | ||
| ProgramThreshold, | ||
| calibrate_all, | ||
| calibrate_program, | ||
| generate_ranker_overrides, | ||
| predict_outcome, | ||
| ) | ||
| from core.admission_data import compute_program_stats |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
scales), background tier classification, internship scoring, and
per-program statistics with feature importance analysis
thresholds, predicts outcomes, evaluates model accuracy, and generates
school_ranker overrides
reach/target/safety classification
https://claude.ai/code/session_014dkZ9Eq3DPVaUfRTeN2HXp